| method | p | q | TPR | FPR | Error |
|---|---|---|---|---|---|
| ncagr | 25 | 50 | 0.947 | 0.002 | 4.249 |
| RegGMM | 25 | 50 | 0.822 | 0.003 | 4.767 |
| ncagr | 50 | 50 | 0.554 | 0.001 | 12.274 |
| RegGMM | 50 | 50 | 0.520 | 0.005 | 12.656 |
| ncagr | 25 | 100 | 0.978 | 0.001 | 3.775 |
| RegGMM | 25 | 100 | 0.756 | 0.001 | 5.098 |
Department of Statistics and Applied Probability, UC Santa Barbara
Signaling pathways associated with Glioblastoma multiforme (GBM)
Encodes the conditional independence structure of Gaussian random variables
Let \(\mathbf{X} \sim \mathcal{N}_p(\mathbf{0}, \Sigma)\) and define the precision matrix \(\Omega = \Sigma^{-1}\).
Key property: \(X_j \perp\!\!\!\perp X_k \mid X_{\setminus\{j, k\}} \iff \Omega_{jk} = 0\)
Estimating the conditional independence of \(p\) Gaussian random variables amounts to estimating the sparsity of the precision matrix \(\Omega\).
Estimated graph of \(p=73\) gene expressions related to GBM
We are concerned with quantitative trait loci (QTL) of a single nucleotide, or single nucleotide polymorphisms (SNPs).
Problem statement: Given data on gene expressions and additional genetic markers, identify the SNPs that mediate the co-expression of genes. Known as expression QTL (eQTL) analysis.
We must incorporate external covariate information into the graphical model.
Suppose in addition to responses \(\mathbf{X} \in \mathbb{R}^p\) we observe a vector \(\mathbf{U}\in\mathbb{R}^q\) of covariates.
Want to model the conditional distribution \(\mathbf{X} \mid \mathbf{U} = \mathbf{u} \sim \mathcal{N}_p(\mu(\mathbf{u}), \Sigma(\mathbf{u}))\) using a graphical model.
Many works allow the mean expression level to depend on covariates, but few allow the graph structure to be adjusted by covariates.
Zhang & Li (2022) propose an additive structure for the precision matrix:
\[\mu(\mathbf{u}) = \Gamma \mathbf{u},\quad \Sigma^{-1}(\mathbf{u}) = \Omega(\mathbf{u}) = B_0 + B_1u_1 + \dotsb + B_qu_q\]
\[ \begin{align*} \mathbf{X}\mid\mathbf{U} &=\mathbf{u} \sim \mathcal{N}_p(\mu(\mathbf{u}), \Omega^{-1}(\mathbf{u})) \\ \mu(\mathbf{u}) &= \Gamma\mathbf u \\ \Omega(\mathbf{u}) &= B_0 + B_1u_1 + \dotsb + B_qu_q \end{align*} \]
\(\Omega_{jk}\) is related to the partial correlation of \(X_j\) and \(X_k\). Estimated via nodewise regression (Meinshausen & Bühlmann ’06).
\[ X_j = \mathbf{u}^\top\boldsymbol{\gamma}_j + (\mathbf{X}_{-j} - \mathbf{u}^\top\mathbf{\gamma}_j)^\top \boldsymbol{\beta}_{j, 0} + \sum_{h=1}^q \underbrace{u_h \times (\mathbf{X}_{-j} - \mathbf{u}^\top\boldsymbol{\gamma}_j)^\top}_{\text{interaction term}} \boldsymbol{\beta}_{j,h} + \epsilon_j \]
This regression problem is not jointly convex in \(\boldsymbol{\gamma}_j\) and \(\boldsymbol{\beta}_j\).
Start with \(\mathbf{X} \sim \mathcal{N}_p(\mu, \Omega^{-1})\).
Define the natural parameters \(\mathbf{\theta} = \operatorname{diag}(\Omega)^{-1}\Omega\mathbf{\mu}\) and \(\Theta = -\operatorname{diag}(\Omega)^{-1}\Omega\).
It follows that \(X_j \mid \mathbf{X}_{-j} = \theta_j + \Theta_{j,-j}\mathbf{X}_{-j} + \epsilon_j, \quad \epsilon_j \sim \mathcal{N}(0, 1/\Omega_{jj})\).
Our model specified as
\[ \begin{align*} \mathbf{X}\mid\mathbf{U} &=\mathbf{u} \sim \mathcal{N}_p(\mu(\mathbf{u}), \Omega^{-1}(\mathbf{u})) \\ \theta(\mathbf{u}) &= \Gamma\mathbf u \\ \Theta(\mathbf{u}) &= B_0 + B_1u_1 + \dotsb + B_qu_q \end{align*} \]
with corresponding nodewise regression problem
\[X_j = \mathbf{u}^\top\boldsymbol{\gamma}_j + \mathbf{X}_{-j}^\top \boldsymbol{\beta}_{j,0} + \sum_{h=1}^q u_h \mathbf{X}_{-j}^\top \boldsymbol{\beta}_{j,h}+ \epsilon_j\]
which is jointly convex in \(\boldsymbol{\gamma}_j\) and \(\boldsymbol{\beta}_j\).
For each node \(j = 1, \dotsc, p\) we estimate
\[ \begin{align*} \gamma_j \in \mathbb{R}^q, \quad \beta_j = ( \underbrace{\beta_{j10},\dotsc, \beta_{jp0}}_{\text{population effects}}, \underbrace{\beta_{j11},\dotsc, \beta_{jp1}}_{\text{$u_1$ effects}}, \dotsc, \underbrace{\beta_{j1q},\dotsc, \beta_{jpq}}_{\text{$u_q$ effects}} )^\top \in \mathbb{R}^{(p-1)\times (q+1)} \end{align*} \]
Impose the sparse-group lasso penalty:
\[(\hat\gamma_j, \hat\beta_j) = \operatorname*{arg min}_{\gamma_j, \beta_j} \left\{\ell_2(\gamma_j, \beta_j) + \eta\lVert\gamma_j\rVert_1 + \lambda\lVert{\beta_j}\rVert_1 + \lambda_g \sum_{h=1}^q \lVert{\beta_{j,h}}\rVert_2\right\}\]
We ran our method on a data set of \(n=401\) GBM patients with \(p=73\) genes belonging to GBM pathways and \(q=118\) SNPs local to these genes. Highly connected genes include the calcium binding protein CALML5 and oncogenic transcription factor E2F3, which have roles in explaining GBM biology.
We ran our method on a data set of \(n=401\) GBM patients with \(p=75\) genes belonging to GBM pathways and \(q=118\) SNPs local to these genes. Highly connected genes include the calcium binding protein CALML5 and oncogenic transcription factor E2F3, which have roles in explaining GBM biology.
Estimated network effects of two SNPs
Left: Variant in BRAF mediates co-expressions of PDGFRA, PDGFB, and SOS2
Right: Variant in CALML4 mediates co-expressions of E2F1 and PIK3CA
Theorem 1 (\(\ell_2\) error): Under some regularity conditions and the assumption \((\color{red}{s_\beta} + \color{green}{s_\gamma})^2 \log(pq) = o(n)\), we have with high probability that \[\lVert{\hat{\gamma_j}-\gamma_j}\rVert_2^2 + \lVert{\hat{\beta_j}-\beta_j}\rVert_2^2 \precsim \frac{1}{n}\left(\color{green}{s_{\gamma}}\log(eq/\color{green}{s_\gamma}) + \color{red}{s_\beta}\log(ep) + \color{blue}{s_{\beta,g}}\log(eq/\color{blue}{s_{\beta,g}})\right)\]
\[ \begin{align*} \color{green}{s_\gamma} &= \text{number of non-zero elements in $\gamma_j$} \\ \color{red}{s_\beta} &= \text{number of non-zero elements in $\beta_j$} \\ \color{blue}{s_{\beta,g}} &= \text{number of non-zero groups in $\beta_j$} \end{align*} \]
Improved scaling assumption compared to existing methods
\[ (s_\beta + s_\gamma)^2 \log(pq) = o(n) \quad \text{vs} \quad \log(pq) = O(n^{1/6}),\, s_\gamma = o(n^{1/3}), \, s_\beta = o(n^{1/6}) \]
Our method provides support recovery under an additional signal strength condition.
Theorem 2 (support recovery): If it holds that \(\operatorname{min}_{\ell, k}\{|\gamma_\ell|, |\beta_k|\} \geq K(\color{blue}{\eta + \lambda + \lambda_g})\) then we have \[ \mathbb{P}(\color{green}{\hat S_\gamma = S_\gamma}, \color{red}{\hat S_\beta = S_\beta}) \geq 1 - C_1 \exp(-C_2 \log p) \]
\[ \begin{align*} \color{blue}{\eta, \lambda, \lambda_g} &: \text{sparse-group lasso penalty factors} \\ \color{green}{\hat S_\gamma, S_\gamma} &: \text{estimated and true support of $\gamma_j$} \\ \color{red}{\hat S_\beta, S_\beta} &: \text{estimated and true support of $\beta_j$} \end{align*} \]
We ran our method ncagr against a benchmark method on a variety of simulated data sets. We are interested in the true positive rate (TPR) and false positive rate (FPR) of detected edges as well as the overall error rate of the nodewise regressions.
| method | p | q | TPR | FPR | Error |
|---|---|---|---|---|---|
| ncagr | 25 | 50 | 0.947 | 0.002 | 4.249 |
| RegGMM | 25 | 50 | 0.822 | 0.003 | 4.767 |
| ncagr | 50 | 50 | 0.554 | 0.001 | 12.274 |
| RegGMM | 50 | 50 | 0.520 | 0.005 | 12.656 |
| ncagr | 25 | 100 | 0.978 | 0.001 | 3.775 |
| RegGMM | 25 | 100 | 0.756 | 0.001 | 5.098 |
Zhang, J. and Li, Y. (2022). High-dimensional gaussian graphical regression models with covariates. Journal of the American Statistical Association 0, 1–13
Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics 34, 1436 – 1462
Fehrmann, R. S. N., et al (2011). Trans-eqtls reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the hla. PLoS Genetics 7, e1002197